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Abstract 

Deep learning has demonstrated the power of detailed modeling of complex high-order 
(multivariate) interactions in data. For some learning tasks there is power in learning 
models that are not only Deep but also Broad. By Broad, we mean models that incorporate 
evidence from large numbers of features. This is of especial value in applications where 
many different features and combinations of features all carry small amounts of information 
about the class. The most accurate models will integrate all that information. In this paper, 
we propose an algorithm for Deep Broad Learning called DBL. The proposed algorithm has 
a tunable parameter n, that specifies the depth of the model. It provides straightforward 
paths towards out-of-core learning for large data. We demonstrate that DBL learns models 
from large quantities of data with accuracy that is highly competitive with the state-of- 
the-art. 

Keywords: Classification, Big Data, Deep Learning, Broad Learning, Discriminative- 

Generative Learning, Logistic Regression, Extended Logistic Regression 


1. Introduction 


The rapid growth in data quantity (Ganz and Reinsel, 2012) makes it increasingly difficult 
for machine learning to extract maximum value from current data stores. Most state-of- 
the-art learning algorithms were developed in the context of small datasets. However, the 
amount of information present in big data is typically much greater than that present in 
small quantities of data. As a result, big data can support the creation of very detailed 
models that encode complex higher-order multivariate distributions, whereas, for small data, 
very detailed models will tend to overfit and should be avoided 


Irain and Webb, 2002 


Martinez et ah, 2015). We highlight this phenomenon in FigureWe know that the error 
of most classifiers decreases as they are provided with more data. This can be observed 
in Figure where the variation in error-rate of two classifiers is plotted with increasing 
quantities of training data on the poker-hand dataset (Frank and Asuncion, 2010| . One is a 
low-bias high-variance learner (KDB fc = 5, taking into account quintic features, ( Sahamif 
1996)) and the other is a low-variance high-bias learner (naive Bayes, a linear classifier). 
For small quantities of data, the low-variance learner achieves the lowest error. However, 
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Figure 1: Comparative study of the error committed by high- and low-bias classifiers on 
increasing quantities of data. 


as the data quantity increases, the low-bias learner comes to achieve the lower error as it 
can better model higher-order distributions from which the data might be sampled. 

The capacity to model different types of interactions among variables in the data is 
a major determinant of a learner’s bias. The greater the capacity of a learner to model 
differing distributions, the lower its bias will tend to be. However, many learners have 
limited capacity to model complex higher-order interactions. 

Deep learning has demonstrated some remarkable successes through its capacity to 
create detailed (deep) models of complex multivariate interactions in structnred data (e.g., 
data in compnter vision, speech recognition, bioinformatics, etc.). Deep learning can be 
characterized in several different ways. But the underlying theme is that of learning higher- 
order interactions among featnres using a cascade of many layers. This process is known as 
‘feature extraction’ and can be un-supervised as it leverages the structure within the data 
to create new features. Higher-order features are created from lower-order features creating 
a hierarchical structure. We conjecture that the deeper the model the higher the order of 
interactions that are captured in the data and the lower the bias that the model exhibits. 

We argue that in many domains there is value in creating models that are broad as well 
as deep. For example, when using web browsing history, or social network likes, or when 
analyzing text, it is often the case that each feature provides only extremely small amounts 
of information about the target class. It is only by combining very large amounts of this 
micro-evidence that reliable classification is possible. 

We call a model broad if it utilizes large numbers of variables. We call a model deep 
and broad if it captures many complex interactions each between numerous variables. For 
example, typical linear classifiers such as Logistic Regression (LR) and Naive Bayes (NB) 
are Broad Learners, in that they utilize all variables. However, these models are not deep, as 


1. Deep neural networks, convolutional deep neural networks, deep belief networks, etc. 
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they do not directly model interactions between variables. In contrast, Logistic Regressiorj^ 


with cubic features (LR^) ( 

Langford et al. 

2007 

and Averaged 2-Dependence Estimators 

(A2DE) (Webb et al., 20111 |Zaidi and Webb 

2012 

), both of which consider all combinations 


of 3 variables, are both Deep and Broad. The parameters of the former are fit discrimina- 
tively through computationally intensive gradient descent-based search, while the param¬ 
eters of the latter are fit generatively using computationally efficient maximum-likelihood 
estimation. This efficient estimation of A2DE parameters makes it computationally well- 
suitable for big data. In contrast, we argue that LR^’s discriminative parameterization can 
more closely fit the data than A2DE, making it lower bias and hence likely to have lower 
error when trained on large training sets. However, the computation required to optimize 
the parameters for LR^ becomes computationally intensive even on moderate dimensional 
data. 

Recently, it has been shown that it is possible to form a hybrid generative-discriminate 
learner that exploits the strengths of both naive Bayes (NB) and Logistic Regression (LR) 
by creating a weighted variant of NB in which the weights are optimized using discriminative 
minimization of conditional log-likelihood (]Zaidi et al. 2013, 2014). From one perspective. 


the resulting learner can be viewed as using weights to alleviate the attribute independence 
assumption of NB. From another perspective it can be seen to use the maximum likelihood 
parameterization of NB to pre-condition the discriminative search of LR. The result is 
a learner that learns models that are exactly equivalent to LR, but does so much more 
efficiently. 

In this work, we show how to achieve the same result with LR”, creating a hybrid 
generative-discriminative learner named DBL” for categorical data that learns equivalent 
deep broad models to those of LR”, but does so more efficiently. We further demonstrate 
that the resulting models have low bias and have very low error on large quantities of 
data. However, to create this hybrid learner we must first create an efficient generative 
counterpart to LR”. 

In short, the contributions of this work are: 


• developing an efficient generative counter-part to LR”, named Averaged re-Join Esti¬ 
mators (AnJE), 

• developing DBL”, a hybrid of LR” and AnJE, 

• demonstrating that DBL” has equivalent error to LR”, but is more efficient, 

• demonstrating that DBL” has low error on large data. 


2. Notation 

We seek to assign a value y G Hy = {yi,... yc} oi the class variable T, to a given example 
X = (xi,..., Xa), where the Xi are value assignments for the a attributes A = {Xi ,..., Xa}. 
We define as the set of all subsets of A of size re, where each subset in the set is denoted 

2. Logistic Regression taking into account all n-level features is denoted by LR", e.g., LR^, LR®, LR^, etc. 
takes into account all quadratic, cubic, quartic, etc. features. 
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as a: 

= {a A : \a\ = n}. 

We use Xa to denote the set of values taken by attributes in the subset a for any data object 

X. 

LR for categorical data learns a weight for every attribute value per class. Therefore, 
for LR, we denote, fiy to be the weight associated with class y, and to be the weight 

associated with attribute i taking value xi with class label y. For LR"’, f3y^a,xa specifies 
the weight associated with class y and attribute subset a. taking value x^. The equivalent 
weights for DEL" are denoted by rcy, Wy^i^xi and Wy^a,xa- 

The probability of attribute i taking value xi given class y is denoted by P(xj|y). 
Similarly, probability of attribute subset a, taking value Xa is denoted by P(xo|y). 



3. Using generative models to precondition discriminative learning 


There is a direct equivalence between a weighted NB and LR (Zaidi et al., 2013[ 2014). We 
write LR for categorical features as: 


PLR(y|x) = exp(^/?y + '^l3y^i^xi - log^exp(^/3c + (1) 

2=1 CGHv j = l 


and NB as: 


PiVB(y|x) 


P(i/) n^=lPi^^\y) 

Eceoy P(c)n“=iP(a;i|c)' 


One can add the weights in NB to alleviate the attribute independence assumption, resulting 
in the WANBIA-C formulation, that can be written as: 


Piv(y|x) = 


P(?/r^ ntiP(^d 

a 

exp(wy log P(y) + ^ U!y^i^cc^ log P{xi \ y)- 
2=1 


log ^ exp(^R;clogP(c) + ^u;cj,a;dogP(3:j|c 

j = l 


( 2 ) 


When conditional log likelihood (CLL) is maximized for LR and weighted NB using Equa¬ 
tion and respectively, we get an equivalence such that /3c oc t(;clogP(c) and /3c,oc 
Wc,i,xi^ogP{xi\c). Thus, WANBIA-C and LR generate equivalent models. While it might 
seem less efficient to use WANBIA-C which has twice the number of parameters of LR, the 
probability estimates are learned very efficiently using maximum likelihood estimation, and 
provide useful information about the classification task that in practice serve to effectively 
precondition the search for the parameterization of weights to maximize conditional log 
likelihood. 
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4. Deep Broad Learner (DBL) 


In order to create an efficient and effective low-bias learner, we want to perform the same 
trick that is used by WANBIA-C for LR with higher-order categorical features. We define 
LR*" as: 


PLi?"(y I x) 


exp(/3y + 


( 3 ) 


We do not include lower-order terms. For example, if n = 2 we do not include terms for 
Py,i,xi as well as for (3y^i^xi,j,xj, because doing so does not increase the space of distinct 
distributions that can be modeled but does increase the number of parameters that must 
be optimized. 

To precondition this model using generative learning, we need a generative model of the 
form 


P(y I x) = _ 

OceOy (p(c) na*e(;^) I ^)) 

= exp(^logP(y)-F ^ logP(x«|y)- 

log ^ exp(^logP(c)-F ^ logP(x„* |c)^^. 
c&Uy 


( 4 ) 


( 5 ) 


The only existing generative model of this form is a log-linear model, which requires compu¬ 
tationally expensive conditional log-likelihood optimization and consequently would not be 
efficient to employ. It is not possible to create a Bayesian network of this form as it would 
require that F{xi,Xj) be independent of P(xj,Xfc). However, we can use a variant of the 
AnDE ( Webb et al.[ 2011, 2005) approach of averaging many Bayesian networks. Unlike 
AnDE, we cannot use the arithmetic mean, as we require a product of terms in Equation 
rather than a sum, so we must instead use a geometric mean. 


4.1 Averaged n-Join Estimators (AnJE) 

Let P be a partition of the attributes A. By assuming independence only between the sets 
of attributes A € V one obtains an n-joint estimator: 

PAnJE(x|y) = P{Xa\y)- 
a&V 

Eor example, if there are four attributes Xi, X 2 , X 3 and X 4 that are partitioned into the 
sets {Xi,X 2 } and {A 3 , A 4 } then by assuming conditional independence between the sets 
we obtain 

FAni-E.{xi,X 2 ,X 2 „X 4 \y) = P(xi, X2 | y)P(x3, X4 11/). 

Let be the set of all partitions of A such that a&vW\ = For convenience we 

assume that |A| is a multiple of n. Let be a subset of that includes each set of n 
attributes once. 
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The AnJE model is the geometric mean of the set of n-joint estimators for the partitions 
The AnJE estimate of conditional likelihood on a per-datum-basis can be written as: 
PAnJE(y|x) OC P(?/)PAnJE(x|2/) 

T—r (n-l)!(a-n) 

<x P(y) Pixaly) • (6) 

This is derived as follows. Each V is of size s = ajn. There are (“) attribute-value n-tuples. 
Each must occur in exactly one partition, so the number of partitions must be 

P=( )/s = 7 -Mw-(7) 

\nj [n — lj!(a — nj! 

The geometric mean of all the AnJE models is thus 


PAnJE(x|y) 








Using Equation]^ we can write the log of P(y|x) as: 

logPAnJE(y|x) OC logP(j/)-F logF{Xa\y). 


( 8 ) 


(9) 


4.2 DBL'^ 

It can be seen that AnJE is a simple model that places the weight defined in Equation on 
all feature subsets in the ensemble. The main advantage of this weighting scheme is that 
it requires no optimization, making AnJE learning extremely efficient. All that is required 
for training is to calculate the counts from the data. However, the disadvantage AnJE is its 
inability to perform any form of discriminative learning. Our proposed algorithm, DEL"' 
uses AnJE to precondition LR” by placing weights on all probabilities in Equation and 
learning these weights by optimizing the conditional-likelihoocQ One can re-write AnJE 
models with this parameterization as: 

PDBL(y|x) = exp(^u;ylogP(y)-F ^ Wy,a,Xa,logP{Xa\y)- 

log ^exp(^U;clogP(c) -h ^ log P(Xa. | c)^ ^ . (10) 

cGfly a*s('^) 

3. One can initialize these weights with weights in Equation pi for faster convergence. 
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Note that we can compute the likelihood and class-prior probabilities using either MLE or 
MAP. Therefore, we can write Equation as: 


logPDBL(y|x) =Wylog7ry+ ^ Wy^a,Xc'^Oge^^\y- 

log ^ exp log TTc + ^ Wc,a* ,x^* log 0^^, \ c) • 


( 11 ) 


ceUy 


Assuming a Dirichlet prior, a MAP estimate of P(?/) is ny which equals: 

#y + rn/\y\ 
t + m ’ 

where ^y is the number of instances in the dataset with class y and t is the total number 
of instances, and m is the smoothing parameter. We will set m = 1 in this work. Similarly, 
a MAP estimate of P(xa|y) is 9xa\c which equals: 


#y + m 

where i^xa,y Is the number of instances in the dataset with class y and attribute values Xa- 
DEL"' computes weights by optimizing CLL. Therefore, one can compute the gradient 


of Equation 11 with-respect-to weights and rely on gradient descent based methods to find 
the optimal value of these weights. Since we do not want to be stuck in local minimums, 
a natural question to ask is whether the resulting objective function is convex [Boyd and 
Vandenberghe (2008). It turns out that the objective function of DEL"' is indeed convex. 

(2005) proved that an objective function of the form logPB(y|x), opti- 


Roos et al. 


mized by any conditional Bayesian network model is convex if and only if the structure Q 
of the Bayesian network B is perfect, that is, all the nodes in Q are moral nodes. DBL"' is a 
geometric mean of several sub-models where each sub-model models interactions each 
conditioned on the class attribute. Each sub-model has a structure that is perfect. Since, 
the product of two convex objective function leads to a convex function, one can see that 
deed’s optimization function will also lead to a convex objective function. 

Let us first calculate the gradient of Equation [TT] with-respect-to weights associated with 


TTy. We can write: 


glogP(?/|x) 

dWn, 


= ly log TTy - 


TTyMogTr^n 

^ W 

= (I?/- P(y|x))log7rj^, 


( 12 ) 


where \y denotes an indicator function that is 1 if derivative is taken with-respect-to class 
y and 0 otherwise. Computing the gradient with-respect-to weights associated with 9xa\y 
gives: 


(9lQgP(y |x) 

dWy^O^^Xa 


loS ^Xa\y 




E ^Wc TT /I 

(ly - P(y|x))la log9x^\y, 


(13) 
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where and ly denotes an indicator function that is 1 if the derivative is taken with- 
respect-to attribute set a (respectively, class y) and 0 otherwise. 


4.3 Alternative Parameterization 

Let us reparameterize DBL"" such that: 

/3y = Wy log TTy, and 
Now, we can re-write Equation as: 
logPLR(y|x) = /3y ^ 


^y,a,Xa ~ '^y,a,Xa ^^S^Xa\y 

(14) 

log ^ exp(^/?c^ I3c,a*,x^*y 

(15) 




It can be seen that this leads to Equation We call this parameterization LR*^. 

Like DEL”, LR"" also leads to a convex optimization problem, and, therefore, its param¬ 
eters can also be optimized by simple gradient decent based algorithms. Let us compute 
the gradient of objective function in Equation 15 with-respect-to fiy. In this case, we can 
write: 


(91ogP(y I x) 

dfiy 


(1 -P(y|x)). 


Similarly, computing gradient with-respect-to /3 q|c, we can write: 


<91ogP(l/|x) 

dPy,a,Xa 


(1 - P(y|x))l„ 


(16) 


(17) 


4.4 Comparative analysis of DBL"^ and LR"^ 

It can be seen that the two models are actually equivalent and each is a re-parameterization 
of the other. However, there are subtle distinctions between the two.. The most important 
distinction is the utilization of MAP or MLE probabilities in DEL"'. Therefore, DEL"' is a 
two step learning algorithm: 

• Step 1 is the optimization of log-likelihood of the data (logP(?/, x)) to obtain the 
estimates of the prior and likelihood probabilities. One can view this step as of 
generative learning. 

• Step 2 is the introduction of weights on these probabilities and learning of these 
weights by maximizing CLL (P{y \ x)) objective function. This step can be interpreted 
as discriminative learning. 

DEL" employs generative-discriminative learning as opposed to only discriminative learning 
by LR". 

One can expect a similar bias-variance profile and a very similar classification perfor¬ 
mance as both models will converge to a similar point in the optimization space, the only 
difference in the final parameterization being due to recursive descent being terminated 
before absolute optimization. However, the rate of convergence of the two models can be 
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very different. Zaidi et al. (2014) show that for NB, such DBL”' style parameterization 


with generative-discriminative learning can greatly speed-up convergence relative to only 
discriminative training. Note, discriminative training with NB as the graphical model is 
vanilla LR. We expect to see the same trend in the convergence performance of DBL” and 
LR". 

Another distinction between the two models becomes explicit if a regularization penalty 
is added to the objective function. One can see that in case of DBL"", optimizing parameters 
towards 1 will effectively pull parameters back towards the generative training estimates. 
For smaller datasets, one can expect to obtain better performance by using a large regu¬ 
larization parameter and pulling estimates back towards 1. However, one cannot do this 
for LR”. Therefore, DBL” models can very elegantly combine generative discriminative 
parameters. 

An analysis of the gradient of DBL” in Equation and 13 and that of LR” in Equa¬ 
tion 16 and 17 also reveals an interesting comparison. We can write DBL”’s gradients in 
terms of LR”’s gradient as follows: 


(91ogP(y|x) _ cHogP(y|x) 

dwy dj3y 

(91ogP(?/|x) cHogP(y|x) 


dw^ 


y,a,Xc, 


dPy^C 


log TTy, 


It can be seen that DBL” has the effect of re-scaling LR”’s gradient by the log of the con¬ 
ditional probabilities. We conjecture that such re-scaling has the effect of pre-conditioning 
the parameter space and, therefore, will lead to faster convergence. 


5. Related Work 

Averaged n-Dependent Estimators (AnDE) is the inspiration for AnJE. An AnDE model 
is the arithmetic mean of all Bayesian Network Classifiers in each of which all attributes 
depend on the class and the some n attributes. A simple depiction of AIDE in graphical 
form in shown in Figure There are (“) possible combination of attributes that can be 
used as parents, producing (“) sub-models which are combined by averaging. 

AnDE and AnJE both use simple generative learning, merely the counting the relevant 
sufficient statistics from the data. Second, both have only one tweaking parameter: n - 
that controls the bias-variance trade-off. Higher values of n leads to low bias and high 
variance and vice-versa. 

It is important not to confuse the equivalence (in terms of the level of interactions they 
model) of AnJE and AnDE models. That is, the following holds: 

/(A2JE) = /(AIDE), 

/(A3JE) = /(A2DE), 

/(AnJE) = /(A(n-I)DE), 

where /(.) is a function that returns the number of interactions that the algorithm models. 
Thus, an AnJE model uses the same core statistics as an A(n-1)DE model. At training 
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Figure 2: Sub-models in an AnDE model with n = 2 and with four attributes. 


time, AnJE and A(n-1)DE must learn the same information from the data. However, at 
classification time, each of these statistics is accessed once by AnJE and n times by A(n- 
1)DE, making AnJE more efficient. However, as we will show, it turns out that AnJE’s use 
of the geometric mean results in a more biased estimator than than the arithmetic mean 
used by AnDE. As a result, in practice, an AnJE model is less accurate than the equivalent 
AnDE model. 

However, due to the use of arithmetic mean by AnDE, its weighted version would be 
much more difficult to optimize than AnJE, as transformed to log space it does not admit 
to a simple linear model. 


A work relevant to DHL"' is that of Greiner et al. (2004); Greiner and Zhou (2002). The 


proposed technique in these papers named ELR has a number of similar traits with DHL"'. 
For example, the parameters associated with a Bayesian network classifier (naive Bayes and 
TAN) are learned by optimizing the GLL. Both ELR and DBL” can be viewed as feature 
engineering frameworks. An ELR (let us say with TAN structure) model is a subset of 
DBL^ models. The comparison of DBL"" with ELR is not the goal of this work. But in our 
preliminary results, DBL”’ produce models of much lower bias that ELR (TAN). Modelling 
higher-order interactions is also an issue with ELR. One could learn a Bayesian network 
structure and create features based on that and then use ELR. But several restrictions 
needs to be imposed on the structure, that is, it has to fulfill the property of perfectness, 
to make sure that it leads to a convex optimization problem. With DBL”, as we discussed 


in Section 4.2 there are no restrictions. Need less to say, ELR is neither broad nor deep. 


Some related ideas to ELR are also explored in Pernkopf and Bilmes (2005); Pernkopf and 
Wohlmayr ( 2009| ); Su et al. (2008). 

Several 


6. Experiments 

In this section, we compare and analyze the performance of our proposed algorithms and 


related methods on 77 natural domains from the UCI repository of machine learning (Frank 
and Asuncion, 2010). The experiments are conducted on the datasets described in Table [ij 
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Domain 

Oase 

Att Class 

Domain 

Case 

Att Class 

Kddcup 

5209000 

41 

40 

Vowel 

990 

14 

11 

Poker-hand 

1175067 

10 

10 

Tic-Tac-ToeEndgame 

958 

10 

2 

MITFaceSetC 

839000 

361 

2 

Annealing 

898 

39 

6 

Covertype 

581012 

55 

7 

Vehicle 

846 

19 

4 

MITFaceSetB 

489400 

361 

2 

PimaIndiansDiabetes 

768 

9 

2 

MITFaceSetA 

474000 

361 

2 

Breast Cancer (Wisconsin) 

699 

10 

2 

Census-Income(KDD) 

299285 

40 

2 

CreditScreening 

690 

16 

2 

Localization 

164860 

7 

3 

BalanceScale 

625 

5 

3 

Connect-40pening 

67557 

43 

3 

Syncon 

600 

61 

6 

Statlog(Shuttle) 

58000 

10 

7 

Chess 

551 

40 

2 

Adult 

48842 

15 

2 

Cylinder 

540 

40 

2 

LetterRecognition 

20000 

17 

26 

Muskl 

476 

167 

2 

MAGIC GammaTelescope 

19020 

11 

2 

HouseVotes84 

435 

17 

2 

Nursery 

12960 

9 

5 

HorseColic 

368 

22 

2 

Sign 

12546 

9 

3 

Dermatology 

366 

35 

6 

PenDigits 

10992 

17 

10 

Ionosphere 

351 

35 

2 

Thyroid 

9169 

30 

20 

LiverDisorders(Bupa) 

345 

7 

2 

Pioneer 

9150 

37 

57 

Primary Tumor 

339 

18 

22 

Mushrooms 

8124 

23 

2 

Haberman’sSurvival 

306 

4 

2 

Musk2 

6598 

167 

2 

HeartDisease( Cleveland) 

303 

14 

2 

Satellite 

6435 

37 

6 

Hungarian 

294 

14 

2 

OpticalDigits 

5620 

49 

10 

Audiology 

226 

70 

24 

PageBlocksClassification 

5473 

11 

5 

New-Thyroid 

215 

6 

3 

Wall-tollowing 

5456 

25 

4 

Classldentihcation 

214 

10 

3 

Nett alk( Phoneme) 

5438 

8 

52 

SonarClassification 

208 

61 

2 

Waveform-5000 

5000 

41 

3 

Autoimports 

205 

26 

7 

Spambase 

4601 

58 

2 

WineRecognition 

178 

14 

3 

Abalone 

4177 

9 

3 

Hepatitis 

155 

20 

2 

Hypothyroid (Caravan) 

3772 

30 

4 

TeachingAssist ant Evaluation 

151 

6 

3 

Sick-euthyroid 

3772 

30 

2 

IrisClassification 

150 

5 

3 

King-rook-vs-king-pawn 

3196 

37 

2 

Lymphography 

148 

19 

4 

Splice-junctionGeneSequences 

3190 

62 

3 

Echocardiogram 

131 

7 

2 

Segment 

2310 

20 

7 

PromoterCeneSequences 

106 

58 

2 

CarEvaluation 

1728 

8 

4 

Zoo 

101 

17 

7 

Volcanoes 

1520 

4 

4 

PostoperativePatient 

90 

9 

3 

Yeast 

1484 

9 

10 

LaborNegotiations 

57 

17 

2 

ContraceptiveMethodChoice 

1473 

10 

3 

LungCancer 

32 

57 

3 

German 

1000 

21 

2 

Contact-lenses 

24 

5 

3 

LED 

1000 

8 

10 






Table 1: Details of Datasets 


There are a total of 77 datasets, 40 datasets with less than 1000 instances, 21 datasets with 
instances between 1000 and 10000, and 16 datasets with more than 10000 instances. There 
are 8 datasets with over 100000 instances. These datasets are shown in bold font in Table [H 
Each algorithm is tested on each dataset using 5 rounds of 2-fold cross validatior[^ 

We compare four different metrics, i.e., 0-1 Loss, RMSE, Bias and Varianc^ 

We report Win-Draw-Loss (W-D-L) results when comparing the 0-1 Loss, RMSE, bias 
and variance of two models. A two-tail binomial sign test is used to determine the signifi¬ 
cance of the results. Results are considered significant if p < 0.05. 

The datasets in Table are divided into two categories. We call the following datasets 
Big - KDDCup, Poker-hand, USCensusl990, Covertype, MITFaceSetB, MITFaceSetA, 
Census-income, Localization. All remaining datasets are denoted as Little in the re¬ 
sults. Due to their size, experiments for most of the Big datasets had to be performed in a 
heterogeneous environment (grid computing) for which CPU wall-clock times are not com¬ 
mensurable. In consequence, when comparing classification and training time, the follow¬ 
ing 9 datasets constitutes Big category - Localization, Connect-4, Shuttle, Adult, 
Letter-recog, Magic, Nursery, Sign, Pendigits. 


4. 

5. 


Exception is MITFaceSetA, MITFaceSetB and Kddcup where results are reported with 2 rounds of 2-fold 
cross validation. 

As discussed in Section]^ the reason for performing bias/variance estimation is that it provides insights 
into how the learning algorithm will perform with varying amount of data. We expect low variance 
algorithms to have relatively low error for small data and low bias algorithms to have relatively low error 
for large data (Brain and Webb 20021. 
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DBL^ vs. 

A2JE 

DBL^ vs. 

A3JE 


W-D-L 

P 

W-D-L 

P 

Little Datasets 

Bias 

66/4/5 

<0.001 

58/2/15 

<0.001 

Variance 

16/3/56 

<0.001 

19/2/54 

<0.001 

0-1 Loss 

42/5/28 

0.119 

37/3/35 

0.906 

RMSE 

37/1/37 

1.092 

30/1/44 

0.130 

Big Datasets 

0-1 Loss 

7/0/1 

0.070 

7/0/1 

0.070 

RMSE 

7/0/1 

0.070 

7/0/1 

0.070 


Table 2: Win-Draw-Loss: DBL^ vs. A2JE and DBL^ vs A3JE. p is two-tail binomial sign test. 
Results are significant if p < 0.05. 


When comparing average results across Little and Big datasets, we normalize the results 
with respect to DBL^ and present a geometric mean. 

Numeric attributes are discretized by using the Minimum Description Length (MDL) 
discretization method ( jFayyad and Irani 1992). A missing value is treated as a separate 
attribute value and taken into account exactly like other values. 


We employed L-BFGS quasi-Newton methods (Zhu et ah, 1997) for solving the opti- 
mizatiorfl 


We used a Random Forest that is an ensemble of 100 decision trees Breiman (2001). 

Both DBL” and LR” are L 2 regularized. The regularization constant C is not tuned 
and is set to 10“^ for all experiments. 

The detailed 0-1 Loss and RMSE results on Big datasets are also given in Appendix 


6.1 DBL^ vs. AnJE 

A W-D-L comparison of the 0-1 Loss, RMSE, bias and variance of DBL” and AnJE on Little 
datasets is shown in Table i We compare DBL^ with A2JE and DBL^ with A3JE only. It 
can be seen that DBL” has significantly lower bias but significantly higher variance. The 
0-1 Loss and RMSE results are not in favour of any algorithm. However, on Big datasets, 
DBL” wins on 7 out of 8 datasets in terms of both RMSE and 0-1 Loss. The results are 
not significant since p value of 0.070 is greater than our set threshold of 0.05. One can infer 
that DBL” successfully reduces the bias of AnJE, at the expense of increasing its variance. 

Normalized 0-1 Loss and RMSE results for both models are shown in Figure It can 
be seen that DBL” has a lower averaged 0-1 Loss and RMSE than AnJE. This difference 
is substantial when comparing on Big datasets. The training and classification time of 
AnJE is, however, substantially lower than DBL” as can be seen from Eigure|^ This is to 
be expected as DBL”" adds discriminative training to AnJE and uses twice the number of 
parameters at classification time. 


6. The original L-BFGS implementation of (Byrd et al. 
edu/~nocedal/lbfgsb.html is used. 


1995) from http://users . eecs .northwestern. 
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Figure 3: Geometric mean of 0-1 Loss (Left), RMSE (Right) performance of DBL^, A2JE, DBL^ 
and A3JE for Little and Big datasets. 


Training Time 


Ciassification Time 


■A2JE 

■dbl^ 

□A3JE 

□dbl^ 


Big 



Figure 4: Geometric mean of Training Time (Left), Classification Time (Right) of DBL^, A2JE, 
DBL^ and A3JE for All and Big datasets. 


6.2 DBL^ vs. AnDE 

A W-D-L comparison for 0-1 Loss, RMSE, bias and variance results of the two DEL” 
models relative to the corresponding AnDE models are presented in Table We compare 
DBL^ with AIDE and DBL^ with A2DE only. It can be seen that DBL” has significantly 
lower bias and significantly higher variance variance than AnDE models. Recently, AnDE 
models have been proposed as a fast and effective Bayesian classifiers when learning from 
large quantities of data (Zaidi and Webb, 2012). These bias-variance results make DBL” a 
suitable alternative to AnDE when dealing with big data. The 0-1 Loss and RMSE results 
(with exception of RMSE comparison of DBL^ vs. A2DE) are similar. 



DBL^ vs. 

AIDE 

DBL^ vs. 

A2DE 


W-D-L 

P 

W-D-L 

P 

Little Datasets 

Bias 

65/3/7 

<0.001 

53/5/17 

<0.001 

Variance 

21/5/49 

0.001 

26/5/44 

0.041 

0-1 Loss 

^21^129 

0.1539 

39/3/33 

0.556 

RMSE 

30/1/44 

0.130 

22/1/52 

<0.001 

Big Datasets 

0-1 Loss 

8/0/0 

0.007 

7/0/1 

0.073 

RMSE 

7/0/1 

0.073 

6/0/2 

0.289 


Table 3: Win-Draw-Loss: DBL^ vs. AIDE and DBL^ vs A2DE. p is two-tail binomial sign test. 
Results are significant if p < 0.05. 
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Figure 5: Geometric mean of 0-1 Loss (Left) and RMSE (Right) performance of DBL^, AIDE, 
DBL^ and A2DE for Little and Big datasets. 




Figure 6: Geometric mean of Training Time (Left), Classification Time (Right) of DBL^, AIDE, 
DBL^ and A2DE for All and Big datasets. 


Normalized 0-1 Loss and RMSE are shown in Figure It can be seen that the DEL"' 
models have lower 0-1 Loss and RMSE than the corresponding AnDE models. 

A comparison of the training time of DEL"' and AnDE is given in Figure As expected, 
due to its additional discriminative learning, DEL"' requires substantially more training time 
than AnDE. However, AnDE does not share such a consistent advantage with respect to 
classification time, the relativities depending on the dimensionality of the data. Eor high¬ 
dimensional data the large number of permutations of attributes that AnDE must consider 
results in greater computation. 


6.3 DEL" vs. LR" 


In this section, we will compare the two DEL" models with their equivalent LR" models. 
As discussed before, we expect to see similar bias-variance profile and a similar classification 
performance as the two models are re-parameterization of each other. 

We compare the two parameterizations in terms of the scatter of their 0-1 Loss and 
RMSE values on Little datasets in Figure 011] respectively, and on Big datasets in Fig¬ 
ure respectively. It can be seen that the two parameterizations (with an exception of 
one dataset, that is: wall-following) have a similar spread of 0-1 Loss and RMSE values 
for both n = 2 and n = 3. 

The comparative scatter of the number of iterations each parameterization takes to 


converge is shown in Figure 11 and 12 for Little and Big datasets respectively. It can be 
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Figure 7: Comparative scatter of 0-1 Loss of DBL^ and LR^ (Left) and DBL^ and LR^ (Right) for 
Little datasets. 




Figure 8: Comparative scatter of 0-1 Loss of DBL^ and LR^ (Left) and DBL^ and LR^ (Right) for 
Big datasets. 


RUSE RUSE 




Figure 9: Comparative scatter of RMSE of DBL^ and LR^ (Left) and DBL^ and LR^ (Right) for 
Little datasets. 
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Figure 10: Comparative scatter of RMSE of DBL^ and LR^ (Left) and DBL^ and LR^ (Right) for 
Big datasets. 




dbl“ dbl® 

Figure 11: Comparative scatter of number of iterations of DBL^ and LR^ (Left) and DBL^ and 
LR^ (Right) for Little datasets. 


seen that the number of iterations for DBL” are far fewer than LR”. With a similar spread 
of 0-1 Loss and RMSE values, it is very encouraging to see that DBL*^ converges in far 
fewer iterations. The number of iterations to converge plays a major part in determining 
an algorithm’s training time. The training time of the two parameterizations is shown in 
Figureandfor Little and Big datasets, respectively. It can be seen that DBL"' models 
are much faster than equivalent LR” models. 


A comparison of rate of convergence of Negative-Log-Likelihood (NLL) of DBL^ and 
LR^ parameterization on some sample datasets is shown in Figure 15 It can be seen that, 
DBL^ has a steeper curve, asymptoting to its global minimum much faster. For example, 
on almost all datasets, one can see that DBL^ follows a steeper, hence more desirable, path 
toward convergence. This is extremely advantageous when learning from very few iterations 
(for example, when learning using Stochastic Gradient Descent based optimization) and, 
therefore, is a desirable property for scalable learning. A similar trend can be seen in 
Figure 16 for DBL^ and LR^. 


vs. 


Finally, let us present some comparison results about the speed of convergence of DBL"' 
LR"' as we increase n. In Figure 17, we compare the convergence for n = 1, n = 2 
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dbl“ dbl* 

Figure 12: Comparative scatter of iterations of DBL^ and LR^ (Left) and DBL^ and LR^ (Right) 
for Big datasets. 



Figure 13: Comparative scatter of Training time of DBL^ and LR^ (Left) and DBL^ and LR^ 
(Right) for Little datasets. 




dbl“ dbl“ 

Figure 14: Comparative scatter of Training time of DBL^ and LR^ (Left) and DBL^ and LR^ 
(Right) for Big datasets. 
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No. of Iterations 


Figure 15: Comparison of rate of convergence of DBL^ and LR^ on several datasets. The X-axis 
(No. of iterations) is on log scale. 


and n = 3 on the sample Localization dataset. It can be seen that the improvement 
that DEL"' provides over LR” gets better as we go to deeper structures, i.e., as n becomes 
larger. The similar behaviour was observed for several datasets and, although studying rates 
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lo” 10' 10^ 10 

No. of Iterallons 


No. of Iterallons 






No. of Iferaflons 


No. of Iferaflons 








No. of Iferaflons 



10“ 10’ 10^ 10 



No. of Iferaflons 
Satellite 



No. of Iterations 



No. of Iterations 


No. of Iterations 


No. of Iterations 


Figure 16: Comparison of rate of convergence of DBL^ and LR^ on several datasets. The X-axis 
(No. of iterations) is on log scale. 


of convergence is a complicated matter and is outside the scope of this work, we anticipate 
this phenomenon to be an interesting venue of investigation for future work. 
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No. of Iterations No. of Iterations No. of Iterations 


Figure 17: Compared rates of convergence of DBL" vs. LR", for n = 1,2,3 on the sample 
Localization dataset. Y-axis is the negative log-likelihood. 



DBL^ 

vs. RF 

DBL^ 

vs. RF 


W-D-L 

P 

W-D-L 

P 

Little Datasets 

Bias 

51/3/21 

<0.001 

52/2/21 

<0.001 

Variance 

33/3/39 

0.556 

28/5/42 

0.119 

0-1 Loss 

40/3/32 

0.409 

37/3/35 

0.906 

RMSE 

26/1/48 

0.014 

27/1/47 

0.026 

Big Datasets 

0-1 Loss 

5/0/3 

0.726 

6/0/2 

0.289 

RMSE 

5/0/3 

0.726 

5/0/3 

0.726 


Table 4: Win-Draw-Loss: DBL^ vs. RF and DBL^ vs RF. p is two-tail binomial sign test. Results 
are significant if p < 0.05. 


6.4 DBL'^ vs. Random Forest 


The two DBL*^ models are compared in terms of W-D-L of 0-1 Loss, RMSE, bias and 
variance with Random Forest in Table On Little datasets, it can be seen that DEL"' 
has significantly lower bias than RF. The variance of DBL^ is significantly higher than RF, 
whereas, difference in the variance is not significant for DBL^ and RF. 0-1 Loss results of 
DBL” and RF are similar. However, RF has better RMSE results than DBL"’ on Little 
datasets. On Big datasets, DBL” wins on majority of datasets in terms of 0-1 Loss and 
RMSE. 

The averaged 0-1 Loss and RMSE results are given in Figure 18 It can be seen that 
DBL^, DBL^ and RF have similar 0-1 Loss and RMSE across Little datasets. However, 
on Big datasets, the lower bias of DBL'^ results in much lower error than RF in terms of 
both 0-1 Loss and RMSE. These averaged results also corroborate with the W-D-L results 
in Table 1^ showing DBL” to be a less biased model than RF. 

The comparison of training and classification time of DBL” and RF is given in Figure [T^ 
It can be seen that DBL” models are worst than RF in terms of the training time but better 
in terms of classification time. 


7. Conclusion and Future Work 

We have presented an algorithm for deep broad learning. DBL consists of parameters that 
are learned using both generative and discriminative training. To obtain the generative 
parameterization for DB, we first developed AnJE, a generative counter-part of higher- 
order logistic regression. We showed that DBL"' and LR" learn equivalent models, but that 
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Figure 18: Geometric mean of 0-1 Loss (Left) and RMSE (Right) performance of DBL^, DBL^ and 
RF for Little and Big datasets. 



Classification Time 



Figure 19: Geometric average of Training Time (Left), Classification Time (Right) of DBL^, DBL^ 
and RF for Little and Big datasets. 


DBL” is able to exploit the information gained generatively to effectively precondition the 
optimization process. DBL” converges in fewer iterations, leading to its global minimum 
much more rapidly, resulting in faster training time. We also compared DBL" with the 
equivalent AnJE and AnDE models and showed that DBL” has lower bias than both AnJE 
and AnDE models. We compared DBL"' with state of the art classiher Random Forest and 
showed that DBL” models are indeed lower biased than RF and on bigger datasets DBL"' 
often obtains lower 0-1 loss than RF. 

There are a number of exciting new directions for future work. 

• We have showed that DBL"' is a low bias classifier with minimal tuning parameters 
and has the ability to handle multiple classes. The obvious extension is to make it 
out-of-core. We argue that DBL"' is greatly suited for stochastic gradient descent 
based methods as it can converge to global minimum very quickly. 

• It may be desirable to utilize a hierarchical DBL, such that fiDBL”^ = {DBL^... DBL”}, 
incorporating all the parameters up till n. This may be useful for smoothing the pa¬ 
rameters. For example, if a certain interaction does not occur in the training data, at 
classification time one can resort to lower values of n. 

• In this work, we have constrained the values of n to two and three. Scaling-up DBL"' 
to higher values of n is greatly desirable. One can exploit the fact that many inter- 
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actions at higher values of n will not occur in the data and hence can develop sparse 
implementations of DBL*^ models. 

• Exploring other objective functions such as Mean-Squared-Error or Hinge Loss can 
result in improving the performance and has been left as a future work. 

• The preliminary version of DHL that we have developed is restricted to categorical 
data and hence requires that numeric data be discretized. While our results show that 
this is often highly competitive with random forest using local cut-points, on some 
datasets it is not. In consequence, there is much scope for investigation of deep broad 
techniques for numeric data. 

• DHL presents a credible path towards deep broad learning for big data. We have 
demonstrated very competitive error on big data and expect future rehnements to 
deliver even more efficient and effective outcomes. 

8. Code and Datasets 

Code with running instructions can be download from https://www.dropbox.com/sh/ 
iw33mgcku9m2quc/AABXwYewVtm0mVE6KoyMPEVFa?dl=0, 
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Appendix A. Detailed Results 


In this appendix, we compare the 0-1 Loss and RMSE results of DHL”', AnDE and RE. 
The goal here is to assess the performance of each model on Big datasets. Therefore, results 
on 8 big datasets are reported only in Table and for 0-1 Loss and RMSE respectively. 
We also compare results with AnJE. Note AlJE is naive Bayes. Also DBL^ results are also 

WANBIA-C IZaidi et al. 


compared. Note, DBL^ is 


2013). 


The best results are shown in bold font. 
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